memcpy 1k字节x86_64虚拟机性能下降分析

wangshuo2021-01-14glibc性能

1 问题背景

1.1 问题现象

x86_64环境上运行memcpy 1k字节时虚拟机的性能比物理机下降了40倍。

1.2 软件信息

软件项版本信息
OSopenEuler 20.03 (LTS)
kernel4.19.90-2003.4.0.0036.oe1.x86_64
glibc2.28
gcc7.3.0

2 结论与解决方法

2.1 结论

起虚拟机的xml文件没有开超线程导致memcpy L3 cache水线在物理机和虚拟机中存在差异,从而引起性能差异。

2.2 解决方法

方法一 虚拟机开超线程

<cpu mode='host-passthrough' check='none'>
   ...
  <topology sockets='2' cores='4' threads='2'/>
  <feature policy='require' name='topoext'/>
</cpu>

方法二 调整memcpy水线

以下为glibc社区推荐配置

# export GLIBC_TUNABLES=glibc.tune.x86_non_temporal_threshold=$(($(getconf LEVEL3_CACHE_SIZE) * 3 / 4))

3 memcpy算法综述

在glibc-2.28中,memcpy和memove共享一套逻辑,其实现算法在glibc的源码中有简要介绍:

sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S

/* memmove/memcpy/mempcpy is implemented as:
   1. Use overlapping load and store to avoid branch.
   2. Load all sources into registers and store them together to avoid
      possible address overlap between source and destination.
   3. If size is 8 * VEC_SIZE or less, load all sources into registers
      and store them together.
   4. If address of destination > address of source, backward copy
      4 * VEC_SIZE at a time with unaligned load and aligned store.
      Load the first 4 * VEC and last VEC before the loop and store
      them after the loop to support overlapping addresses.
   5. Otherwise, forward copy 4 * VEC_SIZE at a time with unaligned
      load and aligned store.  Load the last 4 * VEC and first VEC
      before the loop and store them after the loop to support
      overlapping addresses.
   6. If size >= __x86_shared_non_temporal_threshold and there is no
      overlap between destination and source, use non-temporal store
      instead of aligned store.  */

其中,如第6条所述,如果超过__x86_shared_non_temporal_threshold水线,将使用non-temporal store代替aligned store,本次性能下降问题就属于该场景。

4 执行逻辑

x86环境上进程启动之前,会对水线进行一系列的初始化操作,本次涉及的水线初始化动作如下:

sysdeps/x86/cacheinfo.c

533       /* A value of 0 for the HTT bit indicates there is only a single
534      logical processor.  */
535       if (HAS_CPU_FEATURE (HTT))
536     {
          ...
          计算threads
          ...
693     }

        ...

781   /* The large memcpy micro benchmark in glibc shows that 6 times of
782      shared cache size is the approximate value above which non-temporal
783      store becomes faster on a 8-core processor.  This is the 3/4 of the
784      total shared cache size.  */
785   __x86_shared_non_temporal_threshold
786     = (cpu_features->non_temporal_threshold != 0
787        ? cpu_features->non_temporal_threshold
788        : __x86_shared_cache_size * threads * 3 / 4);

可以看出,虚拟机在没有开超线程的情况下__x86_shared_non_temporal_threshold为0,而物理机为__x86_shared_cache_size * threads * 3 / 4. 当执行memcpy 1k操作时,会按照如下逻辑判断具体需要执行的分支,虚拟机和物理机的逻辑在此之后产生差别。

sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S

455 #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
456     /* Check non-temporal store threshold.  */
457     cmpq    __x86_shared_non_temporal_threshold(%rip), %rdx
458     ja  L(large_backward)
459 #endif

具体逻辑如下:

物理机逻辑
sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S

460 L(loop_4x_vec_backward):
461     /* Copy 4 * VEC a time backward.  */
462     VMOVU   (%rcx), %VEC(0)
463     VMOVU   -VEC_SIZE(%rcx), %VEC(1)
464     VMOVU   -(VEC_SIZE * 2)(%rcx), %VEC(2)
465     VMOVU   -(VEC_SIZE * 3)(%rcx), %VEC(3)
466     subq    $(VEC_SIZE * 4), %rcx
467     subq    $(VEC_SIZE * 4), %rdx
468     VMOVA   %VEC(0), (%r9)
469     VMOVA   %VEC(1), -VEC_SIZE(%r9)
470     VMOVA   %VEC(2), -(VEC_SIZE * 2)(%r9)
471     VMOVA   %VEC(3), -(VEC_SIZE * 3)(%r9)
472     subq    $(VEC_SIZE * 4), %r9
473     cmpq    $(VEC_SIZE * 4), %rdx
474     ja  L(loop_4x_vec_backward)
475     /* Store the first 4 * VEC.  */
476     VMOVU   %VEC(4), (%rdi)
477     VMOVU   %VEC(5), VEC_SIZE(%rdi)
478     VMOVU   %VEC(6), (VEC_SIZE * 2)(%rdi)
479     VMOVU   %VEC(7), (VEC_SIZE * 3)(%rdi)
480     /* Store the last VEC.  */
481     VMOVU   %VEC(8), (%r11)
482     VZEROUPPER
483     ret
虚拟机逻辑
sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S

528 L(loop_large_backward):
529     /* Copy 4 * VEC a time backward with non-temporal stores.  */
530     PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 2)
531     PREFETCH_ONE_SET (-1, (%rcx), -PREFETCHED_LOAD_SIZE * 3)
532     VMOVU   (%rcx), %VEC(0)
533     VMOVU   -VEC_SIZE(%rcx), %VEC(1)
534     VMOVU   -(VEC_SIZE * 2)(%rcx), %VEC(2)
535     VMOVU   -(VEC_SIZE * 3)(%rcx), %VEC(3)
536     subq    $PREFETCHED_LOAD_SIZE, %rcx
537     subq    $PREFETCHED_LOAD_SIZE, %rdx
538     VMOVNT  %VEC(0), (%r9)
539     VMOVNT  %VEC(1), -VEC_SIZE(%r9)
540     VMOVNT  %VEC(2), -(VEC_SIZE * 2)(%r9)
541     VMOVNT  %VEC(3), -(VEC_SIZE * 3)(%r9)
542     subq    $PREFETCHED_LOAD_SIZE, %r9
543     cmpq    $PREFETCHED_LOAD_SIZE, %rdx
544     ja  L(loop_large_backward)
545     sfence
546     /* Store the first 4 * VEC.  */
547     VMOVU   %VEC(4), (%rdi)
548     VMOVU   %VEC(5), VEC_SIZE(%rdi)
549     VMOVU   %VEC(6), (VEC_SIZE * 2)(%rdi)
550     VMOVU   %VEC(7), (VEC_SIZE * 3)(%rdi)
551     /* Store the last VEC.  */
552     VMOVU   %VEC(8), (%r11)
553     VZEROUPPER
554     ret

5 指令差异分析

从上文可知,物理机和虚拟机执行逻辑的最大区别在于mov指令,指令的定义如下:

sysdeps/x86_64/memmove.S

 23 #define PREFETCHNT  prefetchnta
 24 #define VMOVNT      movntdq
 25 /* Use movups and movaps for smaller code sizes.  */
 26 #define VMOVU       movups
 27 #define VMOVA       movaps

可知,物理机逻辑使用的是movaps指令,特点是16字节对齐,而虚拟机逻辑使用的是movntdq指令,该指令是bypass main cache的,这里附上相关介绍。

https://stackoverflow.com/questions/14106477/how-do-non-temporal-instructions-work The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.

The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware): - Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.

通过分析上文的代码可知,movntdq以bypass main cache的方式将数据放入内存,因此性能自然不如movaps指令。

在与社区沟通后我们得知,社区采取的是一种折中策略。对于大块数据的memcpy操作,如果都走L3 cache,虽然能提升memcpy的性能,但是却会对整个系统的性能造成影响,因此制定了水线。

https://sourceware.org/pipermail/libc-alpha/2021-January/121510.html > The performance of memcpy 1024 has recovered. However, there is performance > reduce in host. This is test result (cycle): > > memcpy_10 memcpy_1k memcpy_10k memcpy_1m memcpy_10m > before backport 8 34 187 130848 2325409 > after backport 8 34 182 515156 5282603 > Performance improvement 0.00% 0.00% 2.67% -293.71% -127.17%

I think this is expected because the large copies no longer stay within the cache. This is required to avoid blowing away the entire cache contents for such large copies, negatively impacting whole system performance. This will of course not show up in a micro-benchmark.

6 修改水线后

通过之前的分析可知,虚拟机默认的水线为0,在此参考社区的推荐配置在虚拟机和物理机上进行验证,结果如下(单位cycle数):

物理机memcpy_10memcpy_1kmemcpy_10kmemcpy_1Mmemcpy_10M
配置前8341871308482325409
配置后8341825151565282603
性能提升0.00%0.00%2.67%-293.71%-127.17%
虚拟机memcpy_10memcpy_1kmemcpy_10kmemcpy_1Mmemcpy_10M
配置前8126945555237405304273
配置后8351835092975260913
性能提升0.00%97.24%95.98%2.76%0.82%
对比虚拟机和物理机配置前后的数据可以发现,调整了水线后虚拟机和物理机的性能是一致的。同时,对于物理机,之前水线为__x86_shared_cache_size * threads * 3 / 4,修改后为__x86_shared_cache_size * 3 / 4,水线降低,更容易进入movntdq指令,因此会在1M以后有下降。

【免责声明】本文仅代表作者本人观点,与本网站无关。本网站对文中陈述、观点判断保持中立,不对所包含内容的准确性、可靠性或完整性提供任何明示或暗示的保证。本文仅供读者参考,由此产生的所有法律责任均由读者本人承担。